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Abstract 

The method of maximum entropy (ME) is extended to address the 
following problem: Once one accepts that the ME distribution is to be 
preferred over all others, the question is to what extent are distributions 
with lower entropy supposed to be ruled out. Two applications are given. 
The first is to the theory of thermodynamic fluctuations. The formulation 
is exact, covariant under changes of coordinates, and allows fluctuations 
of both the extensive and the conjugate intensive variables. The second 
appUcation is to the construction of an objective prior for Bayesian infer- 
ence. The prior obtained by following the ME method to its inevitable 
conclusion turns out to be a special case (a = 1) of what are currently 
known under the name of entropic priors. 



1 Introduction 

The goal of inductive inference is to update a prior probability distribution to a 
posterior distribution when new information becomes available. The problem is 
to process this information in a systematic and objective way. When the infor- 
mation is in the form of constraints on the family of conceivable posterior distri- 
butions, there is one inference procedure that is singled out by requirements of 
universality, objectivity, and consistency: it is the method of maximum entropy 
(ME) 0]. The standard justification relies on interpreting entropy, through the 
Shannon axioms, as a measure of the amount of uncertainty in a probability 
distribution |Q, but this justification is not entirely unobjectionable. A rela- 
tively minor problem is that the Shannon axioms refer to discrete probability 
distributions rather than continuous ones. A more serious one is that it is not 
clear that they provide the only way to define the notion of uncertainty. This 
has motivated a number of attempts to justify the ME method directly, without 
invoking questionable measures of uncertainty ||] Q . They have established the 
value of the concept of entropy irrespective of any interpretation in terms of 
heat, or disorder, or uncertainty. In these approaches entropy is purely a tool 
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for consistent reasoning; strictly, entropy needs no interpretation. A welcome 
by-product has been that the entropy thus defined turns out to be directly ap- 
plicable to continuous distributions. In Sect. 2, as background for the subject of 
this paper, we present a brief outline of one such 'no-interpretation' approach. 
Except for one slight modification we follow Ref.Q closely. 

The main body of the paper addresses three problems. The unifying element 
is the particular form of the constraints; unlike most applications of the ME 
method the constraints are not in the form of known expectation values of 
certain variables. 

The first problem we tackle provides an extension of the ME method itself. 
Once one accepts that the maximum entropy distribution is to be preferred over 
all others, the question is to what extent, how strongly, are distributions with 
lower entropy supposed to be ruled out. In statistical mechanics the answer to 
this question is well known. It was first obtained on the basis of combinatorial 
arguments in the pioneering work of Boltzmann, it is the foundation on which 
Einstein formulated his theory of fluctuations and Onsager erected his theory of 
irreversible processes. More recently it was explored by Jaynes Our goal is 
to show (Sect. 3) that the answer can be obtained entirely from within the ME 
framework, without appeals to combinatorics, to large systems, or other forms 
of intuitive and/or approximate arguments. 

The second problem turns out to be a special case of the first: we are con- 
cerned with the theory of fluctuations. The starting point for the standard the- 
ory (see e.g. Ref.||^) is Einstein's inversion of Bolzmann's formula S — fclog 
to obtain the probability of a fluctuation in the form W ^ expS/k. A care- 
ful justification, however, reveals a number of approximations which, for most 
purposes, are legitimate and work very well. Later developments including 
the method of cumulants, the renormalization group, and the connection to 
non-equilibrium thermodynamics succeeded in clarifying most of the remaining 
conceptual and calculational issues. 

A re-examination of fluctuation theory from the point of view of ME is, 
however, valuable. Our general conclusion (Sect. 4) is that the ME point of view 
allows exact formulations; in fact, it is clear that deviations from the canonical 
predictions can be expected, although in general they will be negligible. Other 
advantages of the ME approach include the explicit covariance under changes 
of coordinates, the absence of restrictions to the vicinity of equilibrium or to 
large systems, and the conceptual ease with which one deals with fluctuations of 
both the extensive as well as their conjugate intensive variables. This last point 
is an important one: within the canonical distribution the extensive variables 
are random variables while the intensive ones are fixed parameters, they do 
not fluctuate. There are, however, several contexts in which it makes sense 
to talk about fluctuations of the conjugate variables. We discuss the standard 
scenario of an open system that can exchange say, energy, with its environment. 
An altogether different interpretation, which we will not discuss here, is to 
consider fluctuations in conjugate variables as uncertainties in the estimation of 
parameters Q. 

The third, and last problem we address is that of obtaining an objective 
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prior for use in Bayes' theorem. The goal is to obtain information about an 
unknown quantity 9 on the basis of the observed value of another quantity x 
and of a presumably known relation between x and 8. This is achieved through 
Bayes' theorem, p{9\x) oc n{0)p{x\9). The relation between x and 9 is supplied 
by a known model p{x\9); previous knowledge about 9 is codified into the prior 
probability ^{9). 

The selection of a definite prior is a famously controversial issue. It has 
generated an enormous literature [0. The difhculty lies not so much in a lack 
of knowledge about 9, but rather in that this knowledge is sometimes vague: 
it is not clear how to codify it in an objective way. Faced with this difhculty 
one reasonable attitude is to admit subjectivity, and recognize that different 
individuals may legitimately translate the same vague information into different 
prior distributions. 

An alternative attitude has been to seek some objectivity by demanding 
properties such as invariance under reparametrization or other symmetry trans- 
formations. Considerable effort has been spent searching for that special state 
of knowledge characterized by complete ignorance, and accordingly, there are 
a number of proposals based on the notion of missing information In 
the end, it may turn out that such a search is misguided; non-informative priors 
might not exist p^ . 

A more positive, direct approach is to identify the information that wc do in 
fact possess and then find an objective way to take it into account. Remarkably, 
it turns out that the very conditions that led us to contemplate using Bayes' 
theorem constitute information that can be objectively translated into a prior 
using the ME method. The prior thus obtained (Sect. 5) turns out to be one 
particular member of the family of distributions known as "entropic priors." The 
name and the first derivation of this family for the case of discrete distributions 
are due to Skilling iQ. The generalization to the continuous case and further 
elaborations by Rodriguez appear in Rcf . . The immediate motivation for 
the present work is found in Ref . [p^ . 

2 The logic behind the ME method 

Let our beliefs about x Cz X he codified in a probability distribution m(x). 
When new information becomes available we want to revise m{x) to a posterior 
distribution p{x). The ME method is designed to guide us in selecting p{x) 
when the new information is in the form of a specification of the set of acceptable 
posterior distributions. The information is just a constraint on the region in the 
space of all distributions where the search will be carried out. (These constraints 
can, but need not, be linear.) 

The selection is carried out by ranking the probability distributions accord- 
ing to increasing preference. Two desirable features to be imposed on this 
ranking scheme are the following. The first is a transitivity requirement: if dis- 
tribution pi is preferred over distribution p2, and p2 is preferred over pa, then pi 
is preferred over ps . Such transitive rankings can be implemented by assigning 
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a real number S[p] to each p{x), in such a way that if pi is preferred over p2, 
then S[pi] > S[p2\- The real number S[p] will be called the entropy of p{x). 
(Thus entropies are real numbers by design.) The selected p will be that which 
maximizes S[p]. (Thus maximum entropy.) 

The problem of finding the functional form of S[p] brings us to the second 
desirable feature to be imposed on the ranking scheme. We are looking for 
a general rule of inference; the ranking procedure, the rule S[p], must be of 
universal applicability: the same rule must apply to a variety of different cases. 
If we happen to know what the selected distribution should be in a certain 
special case, then this knowledge can be used to constrain the form of S[p]. 
If enough special cases are known S[p] will be completely determined. These 
special cases - the so-called axioms - must, by their very nature, be self-evident. 

Four axioms are listed below. They all reflect the conviction that changing 
one's mind is a serious matter, that one should only update those aspects of 
one's beliefs for which hard evidence has been supplied. 

Axiom 1: Subdomain independence. If the space X is divided into 
non-overlapping subdomains Di, and information is given about p{x) for x G 
Di , the selection procedure should only revise the (relative) values of p{x) for 
X G Di. If the evidence makes no reference to x ^ Di those values should be left 
unchanged. Non-overlapping subdomains are independent. The power of this 
axiom lies in that the choice of subdomains Di is arbitrary, the consequence is 
that non-overlapping domains contribute additively to S[p]. 

Axiom 2: Coordinate invariance. The ranking should not depend on 
the particular system of coordinates being used. The coordinates used to label 
the points X do not carry any information. The consequence of this axiom is 
that the expression for S[p\ will involve coordinate invariants such as dxp{x) 
and ratios such as p{x)/m{x), where the function m{x) is, at this point, any 
arbitrary measure. 

Axiom 3: Subsystem independence. If a system is composed of two 
subsytems, x — {xi,X2) e X = Xi x X2, the selection procedure should intro- 
duce no correlations for which there was no evidence either in the measure or 
in the constraints. As a consequence of this axiom a logarithm appears in the 
expression for S[p]. 

Axiom 4: Objectivity. If there is no new information there is no reason 
to change one's mind: when there arc no constraints the selected posterior 
distribution should coincide with the prior distribution. The arbitrariness in 
to(x) is now removed: m{x) is the prior distribution. 

The overall consequence of these axioms (for a proof see Q) is that proba- 
bility distributions should be ranked according to their entropy. 



Choosing the prior m{x) can be tricky. When there is no information leading 
us to prefer one microstate of a physical system over another we might as well 
assign equal prior probability to each state. Thus it is reasonable to choose 
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the density of states as the prior distribution m{x); the invariant m{x)dx is the 
number of microstates in dx. This is the basis for statistical mechanics. 

Other examples of relevance to physics arise when there is no reason to prefer 
one region of the space X over another. Then we should assign the same prior 
probability to regions of the same "volume," and we can choose dx m{x) to 
be the volume of a region R in the space X . On the basis of this choice of prior 
one can derive a considerable amount of the formalism of quantum mechanics. 
This includes the "postulates" that quantum states form a Hilbert space, that 
probabilities are computed through the Born rule, and that time evolution is 
unitary 0. 

Notice that through the measure m{x) Laplace's principle of insufficient 
reason still plays a role, albeit in a somewhat modified form. Thus, subjectivity 
has not been eliminated. Just as with Bayes' theorem, what is objective here 
is the manner in which information is processed, not the initial probability 
assignments. 



3 Extending the ME method 

Let X be the space of microstates a; of a physical system [x e X), and let 
m{x)dx be the number of microstates in the range dx. (Although in this and in 
the next section we tend to drift into the language of statistical mechanics it will 
be clear that the central idea is easily exported to other contexts.) We assume 
that the expected values A" of some ua variables a"(x) {a — 1,2, ... ^ua) are 
known, 

(a") = J dxp{x)a"{x) = A" . (2) 

This limited information will certainly not be sufficient to answering all ques- 
tions that one could conceivably ask about the system. Therefore, we make the 
further assumption that the set {a"} has not been randomly chosen, that it has 
been carefully selected because previous experience indicates the information in 
is relevant for our purposes. 

The probability distribution ^0(2;) that best reflects the prior information 
contained in m{x) updated by the information A"' is obtained by maximizing 
(0) subject to the constraints (||). The result is 

Po(a;) = |m(a;)e-^"'^"(^', (3) 
where the partition function Z and the Lagrange multipliers Aq are given by 

Z(A) = / da;m(a;)e-^°°°(^) and - ^ = A" . (4) 
J wAq, 

The question we address concerns the extent to which the maximum en- 
tropy distribution ^0(2;) should be preferred over other distributions with lower 
entropy. Consider a family of distributions p{x\9) which depends on a finite. 
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though arbitrarily large, number ng of parameters 0^ [i — l,2,...,ng) and 
which includes po{x) as one of its elements. We can choose the parameters 6^ 
so that p{x\9 = 0) = po{x). 

The question about the extent that p{x\d — 0) is to be preferred over p(x|6' ^ 
0) can be phrased more suggestively as follows. To what extent do we believe 
that the correct selection should he 9 = rather than 9 ^ 01 Thus, our question 
has metamorphosed into an inquiry about a degree of belief: the probability of 
9, Tr{9). In fact, we should go further back. The original endeavor which led us 
to use the ME method in the first place was a question about the probability of 
X, now we are actually asking about the probability of "x and 9." We want not 
just p{x) but rather p{x, 9); asking about the reliability of the answer poix) has 
led us to expand the universe of discourse from X to X xQ where & is the space 
of parameters 9. It is remarkable that this is precisely the kind of question the 
ME method is designed to answer; the strategy is to determine the distribution 
p{x, 9) by maximizing an entropy subject to whatever constraints are known to 
hold. To proceed we must address two questions: precisely what is the form of 
the entropy to be maximized, and second, what are the constraints on p(x,9). 

No definition of entropy is complete until a measure over the space in ques- 
tion (X X Q) is specified. Our starting point is that a priori there is no known 
connection between x and the arbitrary set of parameters 9. Since a measure 
must not by itself introduce correlations for which there is no evidence, the prior 
measure m{x,9) must be a product, m{x)fi{9) of the known density of states 
mix) and a still unknown measure ii{9) over the space Q. Thus, the entropy to 
be maximized is 

a[p] = - [ dxd9p{x,9) log , (5) 

J m(x)n{9) 

Next we incorporate the crucial piece of information that gives meaning to 
the parameters 9 and establishes the relation between 9 and x: the conditional 
probability p{x\9) is known. This has two consequences: First, the joint dis- 
tribution p{x,9) is constrained to be of the form tt{9)p{x\9). Notice that this 
constraint is not in the usual form of an expectation value. Second, the ambigu- 
ity in the choice of the measure fi{9) in Q is resolved. The family of distributions 
p{x\9) induces a natural distance in the space Q: di"^ = gijd9^d9\ where gij is 
the Fisher-Rao metric Q , 

g., = / (6) 

J P\ I ) Q^j \ j 

Accordingly we choose — g^^^{9), where g{9) is the determinant of gij. 
Having identified the measure and the constraints, we allow the ME method to 
take over. 

The preferred distribution p{x, 9) is chosen by varying 7:{9) to maximize 

a[n] = - j dxd9n{9)p{x\9) log^^M^. (7) 
J g^i '^(9)fn{x) 
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Assume p{x\9) is normalized, / dxp{x\9) = 1. Maximizing (0) with respect to 
variations 6tt{9) such that / d9TT{9) — 1, yields 

= J d9(^- log + Si9)+ log 67:i9) , (8) 

where the required Lagrange multiplier has been written as 1 — log and 

S{9) = - f dxp{x\9)logP^. (9) 

Therefore the probability that the value of 9 should lie within the small volume 
g^/^{9)d9 is 

Tr{9)d9 = i e^'^'>^g^'^{9)d9 with C = ^ d9g^/^{9) e^^'^\ (10) 



Equation ( |10D is our main result. It tells us that, as expected, the preferred 
value of 9 is that which maximizes the entropy S{9) because this maximizes 
the scalar probability density exp5(6'). But it also tells us the degree to which 
values of 9 away from the maximum are ruled out. For macroscopic systems the 
preference for the ME distribution can be overwhelming. 

Note that the density exp5(6') is a scalar function and the presence of the 
Jacobian factor g^^'^{9) makes Eq.(p^ manifestly invariant under changes of the 
coordinates 9^ in the space <d. 



4 Fluctuations 

Fluctuations of the variables a"(x) or of any function b{x) of the microstate x 
are usually computed in terms of the various moments of the canonical ME dis- 
tribution pa{x) given by Eqs.(|^^ (see, however, Ref.jl^). Within this context 
all expected values, such as the constraints (a") — A" and the entropy S{A) 
itself are fixed, they do not fluctuate. The corresponding conjugate variables, 
the Lagrange multipliers Aq = dS/dA", do not fluctuate either. 

The standard way to make sense of A fluctuations is to couple the system of 
interest to a second system, a bath, and allow exchanges of the quantities a" . 
All quantities referring to the bath will be denoted by primes: microstates x', 
density of states m'{x'), variables a'°'{x'), etc. Even though the overall expected 
value (a" + a'") = of the combined system plus bath is fixed, the individual 
expected values (a") = A°' and (a'") = A'" — A^ — A" are allowed to fluctuate. 
The ME distribution ^0(2^7 x') that best reflects the prior information contained 
in m{x) and m'(x') updated by information on the total A^ is 

Poix,x') = —m(x)m'(x')e-^"-(''°(^)+"'° (11) 

But less than ME distributions are not totally ruled out; to explore the possibil- 
ity that the quantity At is distributed between the two systems in a less than 
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optimal way we consider distributions p(x, x' , A) constrained to the form 



p{x,x',A) ^ 'k{A)p{x\A)p{x'\At - A), (12) 

where 

PiM) = ^^^i^)^~^-''''^''^- (13) 

The corresponding entropy is 

5(A) =logZ(A) + A„A", (14) 

with Aq and Z{X) given by Eq.(^. Analogous expressions hold for the primed 
quantities. The formalism simplifies considerably when the bath is large enough 
that exchanges of A do not affect it, and A' remains fixed at Aq. Then 

S'{At ~A)= logZ'(Ao) + Aoa {A-^ - A") = const -Aoa^"- (15) 

The probability that the value of A fluctuates into a small volume g^/^{A)dA 
is given by our main result Eq.(^^, 

7r(A)dA=^e^(^)-^-^°gi/^(A)dA where g^, = -^^^ , (16) 

and C(-^o) is a suitably defined normalization. To the extent that the right 
choice of variables has been made, Eq.(|l^) is exact. 

An important difference with the usual theory stems from the presence of 
the Jacobian factor g^/^(A). This is required by coordinate invariance and can 
lead to small deviations from the canonical predictions. The quantities (Aq) and 
{A") may be close but will not in general coincide with the quantities Aoa and 
^0 at the point where the scalar probability density attains its maximum. When 
this maximum is very sharp and in its vicinity the Jacobian can be considered 
constant the usual results Q follow. The remaining difficulties are purely com- 
putational and of the kind that can in general be tackled systematically using 
the method of steepest descent to evaluate the appropriate generating function. 

Since we are not interested in variables referring to the bath we can integrate 
Eq.(p^ over x', and use the distribution p{x,A) = Tr{A)p{x\A) to compute 
various moments. As an example, the correlation between SXa = Aq, — (Aq,) and 
Sal" ^af'- (Al^) or = Al" - {A'') is 

{6X^6aP) ^ (<5AqM^) = + (Aoa - (A„)) - (A^*)) . (17) 

When the differences Aqq — (Aq) or Aq — (A'') are negligible one obtains the 
usual expression, ((JAaJa'^) « — (5^ . 
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5 Entropic priors 



The last problem we address is that of obtaining a prior n{9) for use in Bayes' 
theorem, p{0\x) oc t:{6)p{x\9). The traditional approach has been to attempt to 
determine or at least to constrain 7t{9) by requiring that it be non-informative, 
that it satisfy coordinate invariance, and so on. The seemingly innocuous but 
fruitful new idea proposed by Rodriguez fl^ is to focus attention on p(x,9) 
instead of ti{9). One could well wonder whether this makes any difference. 
After all, once p{x\9) is known, p{x,9) can be calculated from Tr{9) and vice 
versa. 

It makes a huge difference. The selection of a preferred distribution using 
the ME method demands that one specify in which space the search will be 
conducted. Being a consequence of the product rule, Bayes' theorem requires 
that p{x,9) be defined and that assertions such as "a; and 0" be meaningful. 
The relevant universe of discourse is neither X nor O, but the product X x Q. 

The complete specification of the space X x Q requires a measure m(x, 9). 
At this point we do not know anything about the variables 0, they are totally 
arbitrary. To the extent that no relation between x and 9 is known, the measure 
must be the product m{x)^{9) of the separate measures in the spaces X and O. 
Indeed, the distribution that maximizes 

a[p\ = - f dxd9p{x,9) log 4%^, (18) 
J m{x)fj.(9) 

is p{x, 9) oc m{x)fi{9); it is such that data about x tells us nothing about 9. In 
what follows we assume that m{x) is known; this is part of understanding what 
data it is that has been collected. The measure ^{9) remains undetermined. 

Next wc incorporate the crucial piece of information: in order to infer some- 
thing about 9 on the basis of a measurement of a;, a relation between x and 
9 must exist. The relation is supplied by the model p{x\9). This constrains 
the joint distribution p{x, 9) to be of the form 'k{9)p{x\9) and removes the am- 
biguity in the choice of As mentioned before, there is a natural choice 
jjL{9) = g^^^{9), where g{9) is the determinant of the Fisher-Rao metric gij. 
Having identified the space, the measure, and the constraints, the ME method 
gives the probability tt{9) that the value of 9 should lie within the small volume 
g^/'^{9)d9. It is our previous main result, Eq. (p^) , 

Tr{9)d9oi e^^^^g^/\9)d9. (19) 

It is remarkable that the ingredients that have been used are precisely those 
that led us to consider using Bayes' theorem in the first place. Once the model 
is known, which means that the data space X, its measure m{x), and the condi- 
tional distribution p(a;|^) are given, the prior probability n{9) is unambiguously 
determined. 

We emphasize that tt{9) in Eq.(^9|) is not the least informative distribution, 
it is the distribution after we learn p{x\9). The distribution before we learn 
p{x\9) is fi{9). We do not know it; this is truly noninformative. 
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No doubt the reader recognizes that essentially the same argument has been 
given twice, first in Sect. 3 and then here. There is a reason for this repeti- 
tion. It was not a priori obvious (at least to this author) that there could have 
existed a relation between, say, the theory of thermodynamic fluctuations and 
the problem of selecting priors in Bayesian inference. They are most definitely 
not the same problem; the meanings of the various symbols and the motivations 
driving our interests in these questions do not coincide. It was therefore not 
at all clear that exactly the same mathematical formalism could provide the 
solution to both. Two verbal justifications, rather than just one, were needed. 

The prior Tr{6) in Eq.(p^ is a member of the family of distributions labelled 
by the real parameter a, 

^(0,a) = ^ e"^(^)5i/2(0)^ ^i^^ ^(a) = J dec//^i0)e"^^^\ (20) 

which are known as entropic priors ir^-JTst. The ME approach has unambigu- 
ously selected the a ~ 1 member. Indeed, it is easy to check that values a ^ 1 
do not maximize the a entropy, cr[7r(0, 1 -I- e)] < (7[t:{9, 1)]. 

The a = 1 entropic prior has, in the past, led to manifestly reasonable results. 
Examples include the entropic prior for the family of Gaussians [|l^, and the 
distribution dual to the Maxwell-Boltzmann distribution The justifications 
given for these two cases are totally independent of each other and of ours; 
both are instances of Jeffreys' prior for scale parameters |l^. On the other 
hand, values a ^ I have also been used. To investigate this further we consider 
experiments that can be repeated. 

Experiments need not be repeatable. Assume, however, that successive rep- 
etitions are possible and that they happen to be independent. Suppose, to 
be specific, that the experiment is performed twice so that the space of data 
X X X — X^ consists of the possible outcomes xi and X2- Suppose further that 
9 is not a "random" variable; the value of 6 is fixed but unknown. Then the 
joint distribution in the space X^ x Q is 

p{xi,X2,9) = 7r<-^\9)p{xi,X2\0) = 7r''^\0)p{xi\0)p{x2\e), (21) 

and the appropriate a entropy is 

a(2)[vr] = - / dx^dx2depixuX2.e) log ^=^^^^^^^^—- (22) 

where g^'^\0) is the determinant of the Fisher-Rao metric for p{xi, X2\0). From 
Eq.(||) it follows that g'^f — 2^^ so that g^'^HO) = 2'^g{9), d being the dimension 
of 0. Maximizing (T(^^[7r] subject to J d9 ■jt'^^'> (6) = 1 we get 

-^'^W = ^9'^'iO)e'"''^'' = ^.gV2(0)e2^'^'W, (23) 

where S'(2)(6») = 25(i)(6') = 28(6) is the entropy of p{xi,X2\e). The generaliza- 
tion to n repetitions of the experiment, with data space X", is immediate: the 
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ME prior 7r(")(6') is obtained replacing with 5'(")(6') = nS^'^^B). The 

coefhcient in front of S''^"^(6') remains a = 1 and the prior ■k^'^\9) differs from 
Tv^^^{9). This is puzzhng. Do we have to revise our prior as more data comes 
in? In fact, for large n the prior n^'^^O) above becomes manifestly wrong: the 
exponential preference for the value of that maximizes S'(i)(6') becomes so 
pronounced that no amount of data to the contrary can successfully overcome 
its effect. 

Repeatable experiments present us with a problem, but how do we deny 
preferred status to a = 1 without simultaneously challenging the ME principle 
itself? There is one way out of this dilemma. Readers of Jaynes' work will 
surely recognize the following argument: we have been conducting our search 
with the wrong constraint. There is something we know about repeatable exper- 
iments that we have not incorporated into the ME procedure above. I propose 
it is this: when we say an experiment can be repeated say, twice, n = 2, we 
actually know more than just p{xi,X2\9) = p{xi\9)p{x2\9). We also know that 
forgetting or discarding the value of say X2 , yields an experiment that is totally 
indistinguishable from the single, n = 1, experiment. This additional informa- 
tion is quantitatively expressed by the constraint / dx2p{xi,X2,9) = p{xi,9), 
or equivalently 

J dx2 7r(2) {9)p{xi \9)p{x2\9) = n^^^ {9)p{xi\9) , (24) 

which leads to n^^'>{9) ~ tt'-^\9). In the general case we get the manifestly 
reasonable result 7r*^"-'(0) — tt'^"^^-' (6*) ~ . . . = 7r'^-'(6'); the undesired dependence 
on n has been eliminated. 

The conclusion is that our result Eq.(pj|) stands: a = 1 is the default value. 
Unless there is positive evidence to the contrary, the entropic prior with a = 1 
should be preferred. But, of course, the results of Sect. 3 apply here too. The 
preference for maximum entropy is not absolute: a = 1 is just the maximum 
a distribution, and values of a corresponding to less than maximum a are not 
totally ruled out. 

6 Final remarks 

The method of maximum entropy has been extended to give a quantitative de- 
termination of the degree to which distributions with lower entropy are ruled 
out. The same idea was used to extend the theory of thermodynamic fluctua- 
tions and in the construction of priors for Bayesian inference. That a connection 
between these two historically independent topics should at all exist is in itself 
quite remarkable. 

We conclude with a comment on the reliability of using entropy as a tool 
for reasoning. There are several reasons why the ME method could lead to an 
absurd answer. One possibility is that there is relevant prior information that 
remains unidentified. Another possible reason for failure is a wrong choice of 
variables. Choosing the right variables is perhaps the most serious difficulty 



11 



in statistical mechanics; in fact, it takes many years of indoctrination before it 
is obvious that the Cooper pair wave function is the right variable to describe 
superconductivity. 

These two possibilities, failure to identify the correct constraints or to iden- 
tify the correct variables, do not reflect a flaw of the ME method itself. Of 
course, it is conceivable, that it is the ME axioms that fail, or that real num- 
bers arc not the right way to measure entropy, or even worse, that there is no 
universal set of rules for processing information. But one need not be overly 
cautious in this last respect. It is clear that the ME method is applicable to a 
vast range of problems, and at this point, there are absolutely no signs that the 
exploration of this territory is anywhere near completion. 
Acknowledgments- I am indebted to C. C. Rodriguez and D. A. Davis for 
very valuable discussions. 
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